Spotify has globally become the go-to music library for all music lovers since the last decade. It’s recemmendation model which internally uses artists popularity and content-based filtering based on the genre the user likes is in itself a state-of-the-art model. This is an analysis on how different audio features can lead to popularity and their correlation with genre and subgenres.

Synopsis

Problem Statement

Music taste is subjective but repetitive, which means song listeners can have preference for certain Music Genres over others. Genre classification is a critical task for Music platforms like Spotify as it helps artists connect with new audiences. The volumne of new songs released everyday is in millions so automating this task is of paramount importance. However this is only possible when Genres have unique characteristics which differentiate them from each other. With this dataset we will explore the following:

  • Understanding the song preferences of Spotify userbase
  • It will also serve as a good approximation of song preferences of millennials since majority of Spotify userbase is millennial.
  • What are the key characteristics of a Song Genre?
  • What Factors contribute to Song Popularity?

Use cases for Spotify:

  • How to accurately classify songs into different Genre Categories based on Song characteristics
  • Gauge the popularity of a song before its release on the platform

High Level Approach

We have dual objectives in mind when we’re evaluating the dataset. First to understand key characteristics of a Genre and secondly to determine key factors which contribute to Song popularity

Solution Overview:

  1. Data Dictionary Review – Familiarization with data columns, data source
  2. Specifications of the data – Data size, number of rows and columns, Time period over which it is collected
  3. Feature Datatypes – Check datatype for each feature and convert as appropriate
  4. Data Cleaning – Correcting inconsistent data by filling out missing values and smoothing out noisy data (outliers)
  5. Segregate Numerical and Categorical variables and perform Exploratory Data Analysis
  6. Exploratory Data Analysis – Summary Statistics, Correlation analysis, Visualization (boxplots and distribution plots), count plots

Our current approach will be able to partially address the problem. Using EDA methods we are able to generate insights for:

  • Characteristics of different song genres
  • Song Popularity characteristics

The graphical techniques in combination with summary statistics help us understand the space of the data. It gave us the intuition that each Genre is its own cluster with unique characteristics, and using the right model we can capture those characteristics and make predictions about the Genre.

  1. Using a Decision Tree or K-nearest neighbour model on the dataset can help us get accurate predictions for classification of songs into Genres.
  2. For predicting Song popularity we could use a Decision Tree Regressor/XGBoost model to predict the track popularity.

Benefits for the consumer of the analysis

The analysis will be useful for Spotify, enabling them to automatically classify songs based on song characteristics. Each day millions of songs are posted on the Spotify platform, it is crucial to automatically classify songs into Genres so that they can be fed to the recommendation algorithm which will enable for the songs to show up in the recommendation lists of users who are more likely to listen to the song.

Being able to predict the song popularity, allows Spotify to release the song in selective regions and selective userbase where it is likely to be more popular.

Packages Required

Following are the packages used:-

tidyverse = Allows for data manipulation and loading ggplot for graphics and dplyr for data manipulation
data.table = Reading large tables
gridExtra = For simultaneous display of multiple plots
corrplot = Correlation Matrix Visualization
DT = For HTML display of spotify dataset
kableExtra = For additional features on outputting tables
nnet = For Logistic Regression modeling
car = For checking multicollinearity through Variance Inflation Factor
Rtsne = Cluster grouping of audio features using TSNE algorithm
caret = Training models for hyperparameter tuning
class = For KNN modeling
e1071 = For SVM modeling
rpart = For decision tree modeling
rpart.plot = For plotting decision tree graph
randomForest = For Random Forest modeling
xgboost = For XGBoost modeling
doParallel = For parallel processing
kernlab = For hyperparameter tuning of SVM
ranger = For hyperparameter tuning of Random Forest

# loading packages
packages = c('tidyverse', 'data.table', 'gridExtra', 'corrplot', 'DT', 
             'kableExtra', 'nnet', 'car', 'Rtsne', 'caret', 'class', 'e1071', 
             'rpart', 'rpart.plot', 'randomForest', 'xgboost', 'doParallel', 
             'kernlab', 'ranger')
installed_packages = packages %in% rownames(installed.packages())
if (any(installed_packages == F)) {
  install.packages(packages[!installed_packages])
}
# suppressing warnings
options(warn = -1)
suppressMessages(invisible(lapply(packages, library, character.only = T)))

Data Preparation

Everything to know about the data to be here.

Data Source

The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata around songs from Spotify’s API.

A subset of the data had already been extracted and is available for access on Github on which the analysis has been done. The song database consists of songs, its popularity, artists, the album to which the song belongs to from 6 main categories (EDM, Latin, Pop, R&B, Rap, & Rock) from Jan 1957 to Jan 2020.

Data Dictionary

feature data_type description
track_id character Song unique ID
track_name character Song Name
track_artist character Song Artist
track_popularity double Song Popularity (0-100) where higher is better
track_album_id character Album unique ID
track_album_name character Song album name
track_album_release_date character Date when album released
playlist_name character Name of playlist
playlist_id character Playlist ID
playlist_genre character Playlist genre
playlist_subgenre character Playlist subgenre
danceability double Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy double Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key double The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness double The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode double Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness double Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness double A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness double Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness double Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence double A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo double The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms double Duration of song in milliseconds

Raw Dataset

URL = paste0("https://raw.githubusercontent.com/", 
             "rfordatascience/tidytuesday/master/data/", 
             "2020/2020-01-21/spotify_songs.csv")

spotify_df = fread(URL)
datatable(spotify_df, extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = I('colvis')))

Data Cleaning

The data cleaning process involved the following steps:-

  1. Checking data types for all columns that should be coerced into more intuitive type
  2. Looking for null values
  3. Looking at the duration of the songs to ensure they qualify to be recognized as “songs”
  4. Looking for duplicates. Not necessarily all rows but customized which will clean the data for better analysis
  5. Adding new columns, manipulating and removing existing columns

Investigating data types

str(spotify_df)
## Classes 'data.table' and 'data.frame':   32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Investigating missing rows

# checking missing values for each columns
colSums(is.na(spotify_df))

# counting total number of missing rows and removing them
missing_rows = spotify_df[rowSums(is.na(spotify_df)) > 0, ]
spotify_df = spotify_df[complete.cases(spotify_df), ]

Investigating song duration

# removing songs whose duration is too huge or less
duration_out = boxplot(spotify_df$duration_ms, 
                             plot = F, range = 3)$out
spotify_df = spotify_df[!spotify_df$duration_ms %in% duration_out, ]
nrow(spotify_df[spotify_df$duration_ms < 60000, ])
spotify_df = spotify_df[spotify_df$duration_ms > 60000, ]
rm(duration_out)

Removing columns

# removing playlist name as they are names given by the users 
# which are highly subjective and add least information
unique(spotify_df$playlist_name)
spotify_df = spotify_df[, -c("playlist_name", "playlist_id")]

Investigating duplicates

# checking duplicates based on track name, artist and release date
duplicate_rows = spotify_df[duplicated(
  spotify_df[, c("track_name", "track_artist", "track_album_release_date")])]
duplicate_rows = duplicate_rows[order(track_name), ]

Adding new columns and manipulating existing ones

# returns the mode of categorical values
# if number of occurence is same, returns all of them
mode = function(x, type) {
  if(type == 'subgenre') {
    cat_values = as.factor(x)
    levels(cat_values)[which.max(tabulate(cat_values))] 
  } else {
    cat_values = as.factor(x)
    temp = table(x)
    paste(levels(cat_values)[which(temp == max(temp))], collapse = ',')
  }
}

# aggregating different genre and subgenre in a single row separated by comma
# adding columns for the mode of genre and subgenre for each track
spotify_df = spotify_df %>%
  group_by(track_name, track_artist, track_album_release_date) %>%
  mutate(playlist_genre_new = map_chr(
    playlist_genre, ~toString(setdiff(playlist_genre, .x))), 
    playlist_subgenre_new = map_chr(
    playlist_subgenre, ~toString(setdiff(playlist_subgenre, .x))), 
    genre_mode = mode(playlist_genre, "genre"), 
    subgenre_mode = mode(playlist_subgenre, "subgenre")) %>%
  ungroup()

spotify_df = unite(spotify_df, "playlist_genre", 
               c("playlist_genre", "playlist_genre_new"), 
               sep = ",")
spotify_df = unite(spotify_df, "playlist_subgenre", 
               c("playlist_subgenre", "playlist_subgenre_new"), 
               sep = ",")
spotify_df$playlist_genre = gsub(",$", "", spotify_df$playlist_genre)
spotify_df$playlist_subgenre = gsub(",$", "", spotify_df$playlist_subgenre)

spotify_df = spotify_df[!duplicated(spotify_df[c("track_name", "track_artist", 
                                                 "track_album_release_date")]), ]

# separating date to year, month and day
# assuming the day to be the 1st of the month where missing
# assuming the month to be Jan where month is missing
spotify_df = separate(spotify_df, col = track_album_release_date, 
                      into = c("year", "month", "day"), sep = "-")
colSums(is.na(spotify_df))
spotify_df[is.na(spotify_df)] = "01"
spotify_df[c("year","month", "day")] = sapply(spotify_df[c("year","month", "day")], 
                                               as.integer)

# changing multigenre mode to a single genre based on the euclidean distance
# assigned to the closest the audio features are to the median of concerned genres
audio_features = colnames(spotify_df)[12:23]
single_genre_df = filter(spotify_df, !grepl(",", genre_mode))
multi_genre_df = filter(spotify_df, grepl(",", genre_mode))
median_df = single_genre_df %>%
  select(c('genre_mode', all_of(audio_features))) %>%
  group_by(genre_mode) %>%
  summarise_if(is.numeric, median) %>%
  ungroup()
for(i in 1:nrow(multi_genre_df)) {
  temp = multi_genre_df[i, c('genre_mode', audio_features)]
  multi_genres = strsplit(temp$genre_mode, ",")[[1]]
  dist_vector = c()
  for(j in 1:length(multi_genres)) {
    median_values = filter(median_df, genre_mode == multi_genres[j])
    eucli_dist = dist(rbind(temp[, audio_features], 
                            median_values[, audio_features])[, -c(12)])[1]
    dist_vector = append(dist_vector, eucli_dist)
  }
  multi_genre_df$genre_mode[i] = multi_genres[which(dist_vector == min(dist_vector))]
}
spotify_df = rbind(single_genre_df, multi_genre_df)
rm(median_df, median_values, multi_genre_df, single_genre_df, temp)

Cleaned Dataset

Following is the frequency count of genres.

Genre Count
edm 4969
latin 4170
pop 4406
r&b 4590
rap 5094
rock 4166

After data cleaning, the final dataset looks like below:-

Exploratory Data Analysis

This sections deals with exploring audio features song genre and track popularity

Audio features distribution and outliers

Distribution of audio features

# checking the distribution of audio features
plot_list = list()
for (i in 1:length(audio_features)) {
  plot_list[[i]] = ggplot(spotify_df, aes_string(x = audio_features[i])) + 
    geom_density(color = "darkblue", fill = "lightblue") 
}
do.call(grid.arrange, 
        c(plot_list, list(top = "Density distribution of audio features")))

Boxlot representation of audio features

# checking for outliers in audio features
plot_list = list()
for (i in 1:length(audio_features)) {
  plot_list[[i]] = ggplot(spotify_df, aes_string(y = audio_features[i])) + 
    geom_boxplot(color = "darkblue", fill = "lightblue", outlier.colour = "red", 
                 outlier.shape = 1, outlier.alpha = 0.5)
}
do.call(grid.arrange, 
        c(plot_list, list(top = "Boxplot representation of audio features")))

Default percentage of outliers based on boxplot

# getting default percentage of outliers based on boxplot
for (i in 1:length(audio_features)) {
  out_percent = length(boxplot(spotify_df[, c(audio_features[i])], 
                               plot = F, range = 1.5)
                       $out) * 100 / nrow(spotify_df)
  print(paste0("The outlier percentage for ", 
               audio_features[i], " is ", round(out_percent, 2), "%"))
}
## [1] "The outlier percentage for danceability is 0.89%"
## [1] "The outlier percentage for energy is 0.77%"
## [1] "The outlier percentage for key is 0%"
## [1] "The outlier percentage for loudness is 2.97%"
## [1] "The outlier percentage for mode is 0%"
## [1] "The outlier percentage for speechiness is 9.56%"
## [1] "The outlier percentage for acousticness is 6.7%"
## [1] "The outlier percentage for instrumentalness is 21.5%"
## [1] "The outlier percentage for liveness is 5.73%"
## [1] "The outlier percentage for valence is 0%"
## [1] "The outlier percentage for tempo is 1.71%"
## [1] "The outlier percentage for duration_ms is 3.64%"

Boxplot representation of audio features based on genre

# checking the range in audio features based on genre
plot_list = list()
for (i in 1:length(audio_features)) {
  plot_list[[i]] = ggplot(spotify_df, 
                          aes_string(x = "genre_mode", y = audio_features[i])) + 
    geom_boxplot(color = "darkblue", fill = "lightblue", outlier.colour = "red", 
                 outlier.shape = 1, outlier.alpha = 0.5) + 
      xlab("genre")
}
do.call(grid.arrange, 
        c(plot_list, list(top = "Boxplot representation of audio features for each genre")))

Correlation among audio features with genre and popularity

Correlation of audio features

# correlation between audio features
spotify_df %>%
  select(all_of(audio_features)) %>%
  scale() %>%
  cor() %>%
  corrplot::corrplot(method = 'color', 
                     order = 'hclust', 
                     type = 'upper', 
                     diag = FALSE, 
                     tl.col = 'black',
                     addCoef.col = "grey30",
                     number.cex = 0.6,
                     main = 'Correlation among audio features',
                     mar = c(2,2,2,2),
                     family = 'Avenir')

Audio features over the years

# audio features over the years
# taking mean for each year
year_features_df = spotify_df %>%
  select(c('year', all_of(audio_features))) %>%
  group_by(year) %>%
  summarise_if(is.numeric, mean) %>%
  ungroup()
plot_list = list()
for (i in 1:length(audio_features)) {
    plot_list[[i]] = ggplot(year_features_df, 
                            aes_string(x = "year", y = audio_features[i])) + 
      geom_line() 
}
do.call(grid.arrange, 
        c(plot_list, list(top = "Audio features over the years")))

Correlation among genres

# correlation within genre
# getting median values for each genre and finding correlation between them
genre_audio_df = spotify_df %>%
  select(c('genre_mode', all_of(audio_features)), -c(mode, key)) %>%
  group_by(genre_mode) %>%
  summarise_if(is.numeric, median) %>%
  ungroup() 
genre_audio_df = select(genre_audio_df, -genre_mode)
# scaling the values for better correlation mapping
genre_audio_df = scale(genre_audio_df)
genre_audio_df = t(genre_audio_df)
colnames(genre_audio_df) = sort(unique(spotify_df$genre_mode))

genre_audio_df %>%
  cor() %>%
  corrplot::corrplot(method = 'color', 
                     order = 'hclust', 
                     type = 'upper', 
                     diag = FALSE, 
                     tl.col = 'black',
                     addCoef.col = "grey30",
                     main = 'Correlation among genres',
                     mar = c(2,2,2,2),
                     family = 'Avenir',
                     number.cex=0.85)

Correlation of popularity with audio features

# correlation of popularity with audio features
popularity_features_df = spotify_df %>%
  select(c('track_popularity', all_of(audio_features))) %>%
  group_by(track_popularity) %>%
  summarise_if(is.numeric, mean) %>%
  ungroup()
plot_list = list()
for (i in 1:length(audio_features)) {
  plot_list[[i]] = ggplot(popularity_features_df, aes_string(x = "track_popularity", 
                                                 y = audio_features[i])) + 
    geom_point(shape = 18, color = 4) +
    geom_smooth(method = lm,  linetype = "dashed", color = "darkred", se = F) + 
      xlab("popularity")
}
suppressMessages(do.call(grid.arrange, 
                         c(plot_list, list(top = "Correlation of popularity with audio features"))))

Genre and popularity analysis

Popularity of genres over the years

# popularity of genres over the years
year_genre_features_df = spotify_df %>%
  select(c('year', 'genre_mode', 'track_popularity')) %>%
  group_by(year, genre_mode) %>%
  summarise_if(is.numeric, mean) %>%
  ungroup()
plot_list = list()
for (i in 1:length(genres)) {
  temp = filter(year_genre_features_df, genre_mode == genres[i])
  if (nrow(temp) > 0) {
    plot_list[[i]] = ggplot(temp, 
                            aes_string(x = "year", y = "track_popularity")) + 
      geom_line() + 
      ggtitle(paste0("Popularity trend for ", genres[i])) +
      ylab("popularity")
  }
}
do.call(grid.arrange, plot_list)

Impact of holiday season on any genre

# analysis if holiday season impacts any particular genre
popularity_month_df = spotify_df %>%
  select('month', 'genre_mode', 'track_popularity') %>%
  group_by(month, genre_mode) %>%
  summarise(popularity = mean(track_popularity)) %>%
  ungroup()
ggplot(popularity_month_df, aes(x = month, y = popularity)) +
  geom_line(aes(color = genre_mode)) + theme_bw() + 
    ggtitle("Month-wise popularity of genres") + 
      ylab("popularity") + labs(color = "Genre")

Trend for songs

Number of songs released over the years

# trend for number of songs released
song_count_df = spotify_df %>%
  select('year') %>%
  filter(year <= 2019) %>%
  group_by(year) %>%
  summarise(songs_released = n()) %>%
  ungroup()
ggplot(song_count_df, 
       aes_string(x = "year", y = "songs_released")) + 
  geom_line() + 
    ggtitle("Number of songs released over the years") + 
      ylab("songs released")

Number of songs released for each genre in the last 10 years

# number of songs released for each genre in the last 10 years
song_count_df = spotify_df %>%
  select('year', 'genre_mode') %>%
  filter(year > 2009 & year <= 2019) %>%
  group_by(year, genre_mode) %>%
  summarise(songs_released = n()) %>%
  ungroup()
ggplot(song_count_df, aes(x = year, y = songs_released)) +
  geom_line(aes(color = genre_mode)) + theme_bw() +
    ggtitle("Number of songs released in the last 10 years for each genre") + 
      ylab("songs released") + labs(color = "Genre")

Modeling

This is the modeling section where the genre is tried to be classified based on the audio features

Feature selection and standardization

Data Filtering

# filtering out data before 1970 due to different spikes as observed in the EDA
spotify_df = filter(spotify_df, (year > 1970))
summary(spotify_df[, audio_features])
##   danceability        energy              key            loudness      
##  Min.   :0.0771   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5620   1st Qu.:0.579000   1st Qu.: 2.000   1st Qu.: -8.264  
##  Median :0.6710   Median :0.722000   Median : 6.000   Median : -6.237  
##  Mean   :0.6544   Mean   :0.698464   Mean   : 5.373   Mean   : -6.790  
##  3rd Qu.:0.7600   3rd Qu.:0.843000   3rd Qu.: 9.000   3rd Qu.: -4.699  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness       instrumentalness   
##  Min.   :0.0000   Min.   :0.0224   Min.   :0.0000014   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0411   1st Qu.:0.0144000   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0630   Median :0.0798000   Median :0.0000195  
##  Mean   :0.5622   Mean   :0.1086   Mean   :0.1775430   Mean   :0.0901270  
##  3rd Qu.:1.0000   3rd Qu.:0.1350   3rd Qu.:0.2600000   3rd Qu.:0.0061300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9920000   Max.   :0.9940000  
##     liveness          valence            tempo         duration_ms    
##  Min.   :0.00936   Min.   :0.00001   Min.   : 35.48   Min.   : 60447  
##  1st Qu.:0.09280   1st Qu.:0.32800   1st Qu.: 99.97   1st Qu.:187492  
##  Median :0.12700   Median :0.51100   Median :121.99   Median :216240  
##  Mean   :0.19108   Mean   :0.50953   Mean   :120.93   Mean   :224521  
##  3rd Qu.:0.24900   3rd Qu.:0.69300   3rd Qu.:134.00   3rd Qu.:253600  
##  Max.   :0.99600   Max.   :0.99100   Max.   :239.44   Max.   :450907

Feature Selection and Standardization

# taking only audio features for genre classification
spotify_df = spotify_df %>%
  select(c('genre_mode'), all_of(audio_features))
colnames(spotify_df)[1] = "genre"

# scaling audio features
spotify_df = spotify_df %>%
  mutate_if(is.numeric, scale)

Multicollinearity sanity-check

logistic_model = multinom(genre ~., data = spotify_df)
# checking for multicollinearity
vif(logistic_model)
##     danceability           energy              key         loudness 
##         2.523998         4.692985         1.847775         4.208273 
##             mode      speechiness     acousticness instrumentalness 
##         1.825805         2.340076         3.927155         1.478525 
##         liveness          valence            tempo      duration_ms 
##         1.633940         2.501153         2.104063         1.801746

TSNE cluster analysis

# observing inter-distance of genres and intra-distance within genres
# TSNE algorithm shows very poor clustering
temp = spotify_df %>%
  mutate(ID = row_number())
tsne_fit = temp %>%
  select('ID', all_of(audio_features)) %>%
  column_to_rownames("ID") %>%
  Rtsne(check_duplicates = F)
tsne_df = tsne_fit$Y %>% 
  as.data.frame() %>%
  rename(tSNE1 = "V1", tSNE2 = "V2") %>%
  mutate(ID = row_number())
tsne_df = tsne_df %>%
  inner_join(temp, by = "ID")
tsne_df %>%
  ggplot(aes(x = tSNE1, 
             y = tSNE2,
             color = genre)) +
  geom_point() +
  theme(legend.position = "bottom")

Predicting genre classification

Train-Test split

# train-test split
index = createDataPartition(spotify_df$genre, p = 0.7, list = F)
train_df = spotify_df[index,]
test_df = spotify_df[-index,]

Logistic Regression

# Logistic Regression
logistic_model = multinom(genre ~., data = train_df)
predicted_genre = predict(logistic_model, newdata = train_df[-1])
confusionMatrix(data = predicted_genre, 
                reference = as.factor(train_df$genre))$overall[1]
##  Accuracy 
## 0.4731843
predicted_genre = predict(logistic_model, test_df[-1])
confusionMatrix(data = predicted_genre, reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction edm latin pop r&b rap rock
##      edm   894   171 241  63 174  115
##      latin 115   442 164 159 167   54
##      pop   189   134 331 133  85   86
##      r&b    63   185 178 581 221  143
##      rap   130   201 123 249 823   26
##      rock   99   117 284 182  57  782
## 
## Overall Statistics
##                                          
##                Accuracy : 0.4721         
##                  95% CI : (0.4612, 0.483)
##     No Information Rate : 0.1871         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3655         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity              0.6000      0.35360    0.25057    0.42502     0.5390
## Specificity              0.8855      0.90464    0.90833    0.88372     0.8901
## Pos Pred Value           0.5392      0.40145    0.34551    0.42378     0.5303
## Neg Pred Value           0.9083      0.88555    0.86256    0.88424     0.8935
## Prevalence               0.1826      0.15317    0.16187    0.16750     0.1871
## Detection Rate           0.1095      0.05416    0.04056    0.07119     0.1008
## Detection Prevalence     0.2032      0.13491    0.11739    0.16799     0.1902
## Balanced Accuracy        0.7427      0.62912    0.57945    0.65437     0.7145
##                      Class: rock
## Sensitivity              0.64842
## Specificity              0.89375
## Pos Pred Value           0.51414
## Neg Pred Value           0.93614
## Prevalence               0.14778
## Detection Rate           0.09582
## Detection Prevalence     0.18637
## Balanced Accuracy        0.77109
# calculating p-value
z = summary(logistic_model)$coefficients / summary(logistic_model)$standard.errors
p = (1 - pnorm(abs(z), 0, 1)) * 2
print(p)
##        (Intercept) danceability    energy       key     loudness         mode
## latin 8.659740e-15   0.55939680 0.0000000 0.3701239 1.458917e-05 2.863748e-05
## pop   0.000000e+00   0.00000000 0.0000000 0.4889659 1.318093e-03 1.504019e-06
## r&b   1.629307e-05   0.00000000 0.0000000 0.6680378 1.283080e-09 3.233572e-01
## rap   3.390621e-13   0.01329863 0.0000000 0.8392161 2.174205e-04 8.476298e-02
## rock  0.000000e+00   0.00000000 0.8517028 0.5638046 0.000000e+00 0.000000e+00
##       speechiness acousticness instrumentalness     liveness valence
## latin  0.00267331 0.000000e+00                0 2.752921e-06       0
## pop    0.00000000 3.396404e-09                0 5.052625e-12       0
## r&b    0.00000000 0.000000e+00                0 8.066808e-06       0
## rap    0.00000000 1.669294e-08                0 5.642404e-02       0
## rock   0.00000000 2.091391e-04                0 1.831729e-07       0
##              tempo duration_ms
## latin 0.000000e+00  0.05473688
## pop   2.509104e-14  0.19080674
## r&b   0.000000e+00  0.00000000
## rap   5.280221e-13  0.42707376
## rock  0.000000e+00  0.00000000
# Logistic Regression without key, mode and duration_ms
logistic_model = multinom(genre ~., data = train_df[-c(4, 6, 13)])
predicted_genre = predict(logistic_model, test_df[-c(1, 4, 6, 13)])
confusionMatrix(data = predicted_genre, reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction edm latin pop r&b rap rock
##      edm   896   177 255  63 175  119
##      latin 127   438 153 189 175   58
##      pop   173   120 301 153  83   97
##      r&b    54   195 223 518 195  138
##      rap   122   204 107 267 841   23
##      rock  118   116 282 177  58  771
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4613          
##                  95% CI : (0.4505, 0.4722)
##     No Information Rate : 0.1871          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3525          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity              0.6013      0.35040    0.22786    0.37893     0.5508
## Specificity              0.8817      0.89842    0.90848    0.88151     0.8910
## Pos Pred Value           0.5318      0.38421    0.32470    0.39153     0.5377
## Neg Pred Value           0.9083      0.88435    0.85900    0.87584     0.8960
## Prevalence               0.1826      0.15317    0.16187    0.16750     0.1871
## Detection Rate           0.1098      0.05367    0.03688    0.06347     0.1031
## Detection Prevalence     0.2065      0.13969    0.11359    0.16211     0.1916
## Balanced Accuracy        0.7415      0.62441    0.56817    0.63022     0.7209
##                      Class: rock
## Sensitivity              0.63930
## Specificity              0.89202
## Pos Pred Value           0.50657
## Neg Pred Value           0.93448
## Prevalence               0.14778
## Detection Rate           0.09447
## Detection Prevalence     0.18650
## Balanced Accuracy        0.76566

K-NN

predicted_genre = knn(train = train_df[, -1],
                 test = test_df[, -1],
                 cl = train_df$genre,
                 k = 5)
confusionMatrix(data = predicted_genre, reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction edm latin pop r&b rap rock
##      edm   906   152 249  74 148  103
##      latin 117   455 204 184 190   76
##      pop   223   190 376 198 118  174
##      r&b    57   182 180 515 241  153
##      rap    92   182 101 252 778   36
##      rock   95    89 211 144  52  664
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4526          
##                  95% CI : (0.4418, 0.4635)
##     No Information Rate : 0.1871          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3422          
##                                           
##  Mcnemar's Test P-Value : 0.005594        
## 
## Statistics by Class:
## 
##                      Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity              0.6081      0.36400    0.28463    0.37674    0.50950
## Specificity              0.8912      0.88844    0.86798    0.88034    0.90006
## Pos Pred Value           0.5551      0.37113    0.29398    0.38780    0.53990
## Neg Pred Value           0.9106      0.88536    0.86269    0.87531    0.88854
## Prevalence               0.1826      0.15317    0.16187    0.16750    0.18711
## Detection Rate           0.1110      0.05575    0.04607    0.06311    0.09533
## Detection Prevalence     0.2000      0.15023    0.15672    0.16273    0.17657
## Balanced Accuracy        0.7496      0.62622    0.57631    0.62854    0.70478
##                      Class: rock
## Sensitivity              0.55058
## Specificity              0.91503
## Pos Pred Value           0.52908
## Neg Pred Value           0.92152
## Prevalence               0.14778
## Detection Rate           0.08136
## Detection Prevalence     0.15378
## Balanced Accuracy        0.73280

SVM

svm_model = svm(as.factor(genre) ~ ., data = train_df, kernel = "radial")
predicted_genre = predict(svm_model, newdata = train_df[-1])
confusionMatrix(data = predicted_genre, 
                reference = as.factor(train_df$genre))$overall[1]
##  Accuracy 
## 0.5733627
predicted_genre = predict(svm_model, test_df[-1], type = "C-classification")
confusionMatrix(data = predicted_genre, reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  edm latin  pop  r&b  rap rock
##      edm   1000   126  220   53  112   65
##      latin   71   457  117  125  114   35
##      pop    191   190  453  142   76  128
##      r&b     49   169  184  647  161  141
##      rap    109   240  126  276 1022   23
##      rock    70    68  221  124   42  814
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5383          
##                  95% CI : (0.5274, 0.5492)
##     No Information Rate : 0.1871          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4444          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity              0.6711       0.3656    0.34292    0.47330     0.6693
## Specificity              0.9137       0.9332    0.89371    0.89638     0.8833
## Pos Pred Value           0.6345       0.4973    0.38390    0.47890     0.5690
## Neg Pred Value           0.9256       0.8905    0.87566    0.89427     0.9207
## Prevalence               0.1826       0.1532    0.16187    0.16750     0.1871
## Detection Rate           0.1225       0.0560    0.05551    0.07928     0.1252
## Detection Prevalence     0.1931       0.1126    0.14459    0.16554     0.2201
## Balanced Accuracy        0.7924       0.6494    0.61832    0.68484     0.7763
##                      Class: rock
## Sensitivity              0.67496
## Specificity              0.92451
## Pos Pred Value           0.60792
## Neg Pred Value           0.94254
## Prevalence               0.14778
## Detection Rate           0.09974
## Detection Prevalence     0.16407
## Balanced Accuracy        0.79974

Decision Tree

dt_model = rpart(genre ~ ., data = train_df)
predicted_genre = predict(dt_model, newdata = train_df[-1])
predicted_genre = colnames(predicted_genre)[apply(predicted_genre, 1, which.max)]
confusionMatrix(data = as.factor(predicted_genre), 
                reference = as.factor(train_df$genre))$overall[1]
##  Accuracy 
## 0.4109992
predicted_genre = predict(dt_model, newdata = test_df[-1], type = "class")
confusionMatrix(data = predicted_genre, reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  edm latin  pop  r&b  rap rock
##      edm    629   125  184   52   45   60
##      latin   61   316  181  135  126   33
##      pop     85   154  224  117   73   95
##      r&b     50   120   96  303   78  136
##      rap    401   436  328  529 1103  204
##      rock   264    99  308  231  102  678
## 
## Overall Statistics
##                                          
##                Accuracy : 0.3986         
##                  95% CI : (0.388, 0.4093)
##     No Information Rate : 0.1871         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.2749         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity             0.42215      0.25280    0.16957    0.22165     0.7223
## Specificity             0.93015      0.92244    0.92339    0.92935     0.7139
## Pos Pred Value          0.57443      0.37089    0.29947    0.38697     0.3675
## Neg Pred Value          0.87815      0.87221    0.85202    0.85579     0.9178
## Prevalence              0.18258      0.15317    0.16187    0.16750     0.1871
## Detection Rate          0.07707      0.03872    0.02745    0.03713     0.1352
## Detection Prevalence    0.13417      0.10440    0.09166    0.09594     0.3677
## Balanced Accuracy       0.67615      0.58762    0.54648    0.57550     0.7181
##                      Class: rock
## Sensitivity              0.56219
## Specificity              0.85564
## Pos Pred Value           0.40309
## Neg Pred Value           0.91851
## Prevalence               0.14778
## Detection Rate           0.08308
## Detection Prevalence     0.20610
## Balanced Accuracy        0.70892
rpart.plot(dt_model, 
           type = 5, 
           extra = 104,
           box.palette = list(purple = "#490B32",
                              red = "#9A031E",
                              orange = '#FB8B24',
                              dark_blue = "#0F4C5C",
                              blue = "#5DA9E9",
                              grey = '#66717E'),
           leaf.round = 0,
           fallen.leaves = FALSE, 
           branch = 0.3, 
           under = TRUE,
           under.col = 'grey40',
           family = 'Avenir',
           main = 'Genre Decision Tree',
           tweak = 1.2)

Random Forest

rf_model = randomForest(as.factor(genre) ~ ., data = train_df,  
                        ntree = 500, importance = T)
predicted_genre = predict(rf_model, newdata = train_df[-1])
confusionMatrix(data = predicted_genre, 
                reference = as.factor(train_df$genre))$overall[1]
##  Accuracy 
## 0.9962217
predicted_genre = predict(rf_model, newdata = test_df[-1])
confusionMatrix(data = predicted_genre, reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  edm latin  pop  r&b  rap rock
##      edm   1059   109  215   37   91   43
##      latin   69   517  133  106  110   33
##      pop    188   171  450  133   57  123
##      r&b     46   173  187  668  163  119
##      rap     81   215  117  293 1060   23
##      rock    47    65  219  130   46  865
## 
## Overall Statistics
##                                           
##                Accuracy : 0.566           
##                  95% CI : (0.5551, 0.5768)
##     No Information Rate : 0.1871          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4778          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity              0.7107      0.41360    0.34065    0.48866     0.6942
## Specificity              0.9258      0.93474    0.90175    0.89873     0.8901
## Pos Pred Value           0.6815      0.53409    0.40107    0.49263     0.5925
## Neg Pred Value           0.9348      0.89810    0.87626    0.89728     0.9267
## Prevalence               0.1826      0.15317    0.16187    0.16750     0.1871
## Detection Rate           0.1298      0.06335    0.05514    0.08185     0.1299
## Detection Prevalence     0.1904      0.11861    0.13748    0.16616     0.2192
## Balanced Accuracy        0.8183      0.67417    0.62120    0.69370     0.7921
##                      Class: rock
## Sensitivity               0.7172
## Specificity               0.9271
## Pos Pred Value            0.6305
## Neg Pred Value            0.9498
## Prevalence                0.1478
## Detection Rate            0.1060
## Detection Prevalence      0.1681
## Balanced Accuracy         0.8222

XGBoost

xgb_model = xgboost(data = as.matrix(train_df[-1]), 
                    label = as.integer(as.factor(train_df$genre)),
                    nrounds = 25,
                    verbose = FALSE, 
                    params = list(objective = "multi:softmax",
                                  num_class = 6 + 1))
predicted_genre = predict(xgb_model, newdata = as.matrix(train_df[-1]))
predicted_genre = levels(as.factor(train_df$genre))[predicted_genre]
confusionMatrix(data = as.factor(predicted_genre), 
                reference = as.factor(train_df$genre))$overall[1]
##  Accuracy 
## 0.6870802
predicted_genre = predict(xgb_model, newdata = as.matrix(test_df[-1]))
predicted_genre = levels(as.factor(test_df$genre))[predicted_genre]
confusionMatrix(data = as.factor(predicted_genre), 
                reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  edm latin  pop  r&b  rap rock
##      edm   1056   111  196   38   76   44
##      latin   62   501  124  135  122   36
##      pop    201   212  467  133   89  136
##      r&b     55   166  182  644  179  127
##      rap     72   198  129  286 1018   31
##      rock    44    62  223  131   43  832
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5536          
##                  95% CI : (0.5427, 0.5644)
##     No Information Rate : 0.1871          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.463           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity              0.7087      0.40080    0.35352    0.47110     0.6667
## Specificity              0.9303      0.93069    0.88728    0.89564     0.8921
## Pos Pred Value           0.6943      0.51122    0.37722    0.47598     0.5871
## Neg Pred Value           0.9346      0.89570    0.87664    0.89380     0.9208
## Prevalence               0.1826      0.15317    0.16187    0.16750     0.1871
## Detection Rate           0.1294      0.06139    0.05722    0.07891     0.1247
## Detection Prevalence     0.1864      0.12008    0.15170    0.16579     0.2125
## Balanced Accuracy        0.8195      0.66575    0.62040    0.68337     0.7794
##                      Class: rock
## Sensitivity               0.6899
## Specificity               0.9277
## Pos Pred Value            0.6232
## Neg Pred Value            0.9452
## Prevalence                0.1478
## Detection Rate            0.1019
## Detection Prevalence      0.1636
## Balanced Accuracy         0.8088

Feature Importance of Decision Tree, Random Forest and XGBoost

# comparing feature importance between Decision Tree, Random Forest and XGBoost
importance_dt = data.frame(importance = dt_model$variable.importance)
importance_dt$feature = row.names(importance_dt)
importance_rf = data.frame(importance = randomForest::importance(rf_model, type = 2))
importance_rf$feature = row.names(importance_rf)
importance_xgb = xgb.importance(model = xgb_model)
compare_importance = importance_xgb %>%
  select(Feature, Gain) %>%
  left_join(importance_dt, by = c('Feature' = 'feature')) %>%
  left_join(importance_rf, by = c('Feature' = 'feature')) %>%
  rename('xgboost' = 'Gain',
         'decision_tree' = 'importance',
         'random_forest' = 'MeanDecreaseGini')
compare_importance = compare_importance %>%
  mutate_if(is.numeric, scale) %>%
  pivot_longer(cols = c('xgboost', 'decision_tree', 'random_forest')) %>%
  rename('model' = 'name')
ggplot(compare_importance, aes(x = reorder(Feature, value, na.rm = T), y = value, color = model)) + 
  geom_point(size = 2) + 
  coord_flip() +
  labs(title = 'Variable Importance by Model',
       y = 'Scaled value', x = '')

Hyperparameter Tuning

Please note that hyperparameter tuning took about an hour to run after parallelizing and hence has been commented. Please uncomment the code to run the tuning model. Also, Random Search has been used to make it less computationally expensive instead of Grid Search. Hence the results of the hyperparameters will most likely differ.

Creating parallel processing

# cl = makePSOCKcluster(6)
# registerDoParallel(cl)
# fitControl = trainControl(search = 'random', method = "repeatedcv",
#                           number = 5, repeats = 3, allowParallel = T)

SVM

# svm_fit = train(genre ~ ., data = train_df,
#                 method = "svmRadial",
#                 trControl = fitControl,
#                 verbose = F,
#                 tuneLength = 2)
# print(svm_fit)

Random Forest

# rf_fit = train(genre ~ ., data = train_df,
#                method = "ranger",
#                trControl = fitControl,
#                verbose = F,
#                tuneLength = 5)
# print(rf_fit)

XGBoost

# xgb_fit = train(genre ~ ., data = train_df,
#                 method = "xgbTree",
#                 trControl = fitControl,
#                 verbose = F,
#                 tuneLength = 10)
# print(xgb_fit)
# stopCluster(cl)

Parameters that Random Search outputted in one of the iterations:-

  • SVM -> sigma = 0.01, C = 18.8
  • Random Forest -> min.node.size = 18, mtry = 6
  • XGBoost -> nrounds = 910, max_depth = 5, eta = 0.012, gamma = 3.8, colsample_bytree = 0.5, min_child_weight = 14, subsample = 0.85

Stacking

Using the hyperparameters found, we will create a stacking layer from SVM, Random Forest and XGBoost for Logistic Regression

svm_model = svm(as.factor(genre) ~ ., 
                data = train_df, 
                kernel = 'radial', 
                sigma = 0.01, 
                C = 18.8)
svm_pred = predict(svm_model, newdata = train_df[-1])
svm_pred_test = predict(svm_model, newdata = test_df[-1])

rf_model = randomForest(as.factor(genre) ~ ., 
                        data = train_df,  
                        ntree = 500, 
                        importance = T, 
                        mtry = 6, 
                        min.node.size = 18)
rf_pred = predict(rf_model, newdata = train_df[-1])
rf_pred_test = predict(rf_model, newdata = test_df[-1])

xgb_model = xgboost(data = as.matrix(train_df[-1]), 
                    label = as.integer(as.factor(train_df$genre)),
                    nrounds = 910,
                    max_depth = 5,
                    eta = 0.012, 
                    gamma = 3.8,
                    colsample_bytree = 0.5, 
                    min_child_weight = 14, 
                    subsample = 0.85,
                    verbose = FALSE, 
                    params = list(objective = "multi:softmax",
                                  num_class = 6 + 1))
xgb_pred = predict(xgb_model, newdata = as.matrix(train_df[-1]))
xgb_pred = levels(as.factor(train_df$genre))[xgb_pred]
xgb_pred_test = predict(xgb_model, newdata = as.matrix(test_df[-1]))
xgb_pred_test = levels(as.factor(test_df$genre))[xgb_pred_test]

stacked_df = data.frame(svm = svm_pred, 
                        rf = rf_pred,
                        xgb = xgb_pred,
                        genre = train_df[1])
stacked_df_test = data.frame(svm = svm_pred_test, 
                        rf = rf_pred_test,
                        xgb = xgb_pred_test,
                        genre = test_df[1])

logistic_model = multinom(genre ~., data = stacked_df)
predicted_genre = predict(logistic_model, newdata = stacked_df[-4])
confusionMatrix(data = as.factor(predicted_genre), 
                reference = as.factor(stacked_df$genre))$overall[1]
##  Accuracy 
## 0.9962741
predicted_genre = predict(logistic_model, stacked_df_test[-4])
confusionMatrix(data = predicted_genre, 
                reference = as.factor(stacked_df_test$genre))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  edm latin  pop  r&b  rap rock
##      edm   1041   108  208   31   86   39
##      latin   70   518  144  116  106   38
##      pop    204   177  447  143   69  124
##      r&b     39   166  187  668  169  121
##      rap     84   214  123  289 1056   24
##      rock    52    67  212  120   41  860
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5624          
##                  95% CI : (0.5516, 0.5732)
##     No Information Rate : 0.1871          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4736          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity              0.6987      0.41440    0.33838    0.48866     0.6916
## Specificity              0.9292      0.93141    0.89518    0.89962     0.8894
## Pos Pred Value           0.6880      0.52218    0.38402    0.49481     0.5899
## Neg Pred Value           0.9325      0.89789    0.87509    0.89737     0.9261
## Prevalence               0.1826      0.15317    0.16187    0.16750     0.1871
## Detection Rate           0.1276      0.06347    0.05477    0.08185     0.1294
## Detection Prevalence     0.1854      0.12155    0.14263    0.16542     0.2193
## Balanced Accuracy        0.8140      0.67291    0.61678    0.69414     0.7905
##                      Class: rock
## Sensitivity               0.7131
## Specificity               0.9293
## Pos Pred Value            0.6361
## Neg Pred Value            0.9492
## Prevalence                0.1478
## Detection Rate            0.1054
## Detection Prevalence      0.1657
## Balanced Accuracy         0.8212

Summary

Genre

The genre classfication model achieved an accuracy of 57%. Though better than a random guess ( \(1/6\) ), the exercise gave a lot of inner understanding of the data space and what could have been better

  • The genres were labeled by the end-users rather than the company itself which made it highly subjective and required music understanding in order to correctly label the genres.
  • Given there were many end-users who labeled, the problem became a multi-label rather than a single label.
  • There is a lot of overlapping of genres if only the audio features are considered. It could be either lack of subject matter expertise by the users who labeled the genre or that we are currently living in a world where music is not defined by genres rather than a mix of audio features that the end-user wants to listen.

speechiness, danceability and tempo were the main features that helped in identifying a few genres. Rap ( \(68\)% accuracy ) was associated with high speechiness, Rock ( \(68\)% accuracy ) could be explained by low danceability and EDM (\(72\)% accuracy ) was linked with high tempo. Latin ( \(40\)% accuracy ), R&B ( \(47\)% accuracy ) and Pop ( \(36\)% accuracy ) were the most difficult genres to classify though Latin was somewhat differentiated by high danceability and R&B had a relatively high duration_ms. The models had a hard time recognizing pop as a genre due to it’s high range of almost every audio feauture and hence had the lowest accuracy, sensitivity and specificity.

Popularity

  • Loudness, energy, danceability and speechiness has increased over the last couple of decades whereas acousticness and duration_ms has decreased. The perception of millennials and Gen Z seems to be getting verified by the data itself :)
  • R&B is on the verge of decline while Rap as a genre has seen the most significant rise.
  • EDM sees the highest increase in popularity as the festive season approaches.
  • While EDM seems to be reaching the point of saturation (number of songs vs popularity), Rock is the most “pure” genre.